A maximum smoothed likelihood estimator in the current status 

continuous mark model 

Piet Groeneboom, Geurt Jongbloed & Birgit Witte 
September 7, 2011 



Address: 

Piet Groeneboom & Geurt Jongbloed 

Delft University of Technology 

Delft Institute of Applied Mathematics 

Mekelweg 4 

2628 CD Delft 

The Netherlands 



Birgit I. Witte (Corresponding author) 

VU University Medical Center 

Department of Epidemiology and Biostatistics 

PO Box 7057 

1007 MB Amsterdam 

The Netherlands 

B . WitteOvumc . nl 



Abstract 



We consider the problem of estimating the joint distribution function of the event time and a 
continuous mark variable based on censored data. More specifically, the event time is subject to 
current status censoring and the continuous mark is only observed in case inspection takes place 
after the event time. The nonparametric maximum likelihood estimator (MLE) in this model 
is known to be inconsistent. We propose and study an alternative likelihood based estimator, 
maximizing a smoothed log-likelihood, hence called a maximum smoothed likelihood estimator 
(MSLE). This estimator is shown to be well defined and consistent, and a simple algorithm is 
described that can be used to compute it. The MSLE is compared with other estimators in a 
small simulation study. 
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1 Introduction 



In survival analysis one is interested in the distribution of the time X it takes before a certain event 
(failure, onset of a disease) takes place. Typically, the variable X is not observed completely, due 
to some sort of censoring. Depending on the censoring mechanism and the precise assumptions 
imposed on the distribution function Fq of X, many estimators have been defined and studied in 
the literature. 



In the context of case I interval censoring, |Groeneboom and Wellner (1992) study the (nonpara- 
metric) maximum likelihood estimator (MLE). It maximizes the likelihood of the observed data 
over all distribution functions, without any additional constraints. In case X is subject to right- 
censoring, the MLE is the Kaplan-Meier estimator (Kaplan and Meier (1958)). In these models, 
the resulting estimators are piecewise constant between jumps, and therefore fail to have a density 
w.r.t. Lebesgue measure. 

If the quantity of interest is bivariate, (X, Y), with joint distribution function Fq, the situation 
is more complicated. If both components X and Y of the pair (X, Y) are subject to right censoring, 
the MLE is inconsistent, see Tsai et al. (1986) , and modifications of the MLE to ensure consistency 
have been discussed by several authors, see, e.g., Dabrowska (1988) and van der Laan (1996) In 



case both X and Y are subject to interval censoring, the MLE is consistent, see, e.g.,|Song (2001) 



In computing the MLE, one first has to determine the set of points where the MLE can have mass 
(which is different from the set of observations) . |Maathuis (2005) provides two reduction algorithms 
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that can be used to determine this set. The reason that the MLE is consistent for the bivariate 
current status model, and inconsistent for the bivariate right censoring model (where one has in 
fact more information) , is that in the latter case the MLE only uses the information on "lines" , if 
the observation is uncensored in one coordinate, and does not use the surrounding information for 
the uncensored coordinate. One would need information on the conditional distribution on these 
lines to distribute mass in such a way that a consistent estimate would result, but this conditional 
distribution is not available, since it is part of the estimation problem. In the current status 
model or the interval censoring model with more observation times for the "hidden variable" , one 
only has information on the interval to which the hidden variable belongs, and the MLE therefore 
automatically uses the surrounding information. For this reason a reduction of the bivariate right- 
censoring model to the interval censoring model has been proposed to obtain consistent estimators 
of the bivariate distribution function: in this way the information of a whole set of lines is combined. 

An interesting situation arises when X is a survival time and Y a (continuous) mark variable. 
In case X is subject to right-censoring and Y is only observed if X is observed, |Huang and Louis 



(1998) study a nonparametric estimator of the bivariate distribution function Fq. This estimator 



is uniformly strongly consistent and asymptotically normally distributed. Hudgens et al. (2007) 



study several estimators for the joint distribution of a survival time and a continuous mark variable, 
when the survival time is interval censored and the mark variable is possibly missing. In this paper 



a computational algorithm for the MLE is proposed, but since the MLE is inconsistent (Maathuis 



and Wellner (2008) ), one would be inclined to recommend not to use this estimator. 



The model we focus on in this paper, the current status continuous mark (CSCM) model, is 
a special case of the model studied in the latter paper, since the X is subject to current status 
censoring, the simplest case of interval censoring, and Y is only observed if the event time was 
before the censoring time. More precisely, instead of observing (X,Y), we observe a variable T, 
independent of (X, Y), as well as the variable A = l{x<r}- I n case the variable X is smaller than 
or equal to T, i.e. A = 1, we also observe the variable Y, in case A = we do not. We denote the 
(bivariate) distribution function of (X, Y) by Fq and assume it to have a density fo w.r.t. Lebesgue 
measure. Because P{Y = 0) = under Fq, we can represent the observable information on (X,Y) 
in the vector W = (T, A • Y). 

An application where observations can be modeled by this model is the HIV vaccine trial studied 
by Hudgens et al. (2007) In these HIV vaccine trials, participants are injected with a vaccine and 
tested for infection with HIV during several follow-ups. Efficacy of the vaccine might depend on 
the genetic sequence of the exposing virus, and the so-called viral distance Y between the DNA of 
the infecting virus and the virus in the vaccine could be considered as a continuous mark variable. 
In general, the time X to HIV infection is subject to interval censoring case k, with current status 
censoring (or interval censoring case 1) as a special instance. 



The MLE in the CSCM model is inconsistent and Maathuis and Wellner (2008) obtain a con- 
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sistent estimator by discretizing the mark variable to K levels. The resulting observations can 
then be viewed as observations from the current status ET-competing risk model. Apart from con- 
sistency, global and local asymptotic distribution properties for the MLE in the latter model are 
proved in Groeneboom et al. (2008a, 2008b). Asymptotic results for K — > oo as n — > oo are not 



yet known. Another approach to obtain a consistent estimator is adopted in Groeneboom et al. 



(2011). There a plug-in inverse estimator for the bivariate distribution function Fq, using kernel 
estimators, is studied and its asymptotic distribution is derived. In contrast to the proposed ap- 
proach of Maathuis and Wellner (2008), this estimator does have a Lebesgue density on [0,oo) 2 . 
Unfortunately, for finite sample size n, this estimator does not necessarily satisfy the conditions of 
a bivariate distribution function (i.e. each rectangle has nonnegative mass). If estimators are to 
be used in bootstrap experiments, this is a serious drawback, since it is not clear how to interpret 
sampling from such a "distribution" . 

In this paper we consider an alternative method, the method of maximum smoothed likelihood. 



This is a natural approach since also in other models where MLE's are inconsistent (Jongbloed 



(2009)) or nonsmooth (Groeneboom et al. (2010)), maximum smoothed likelihood estimators 



(MSLEs) provide consistent and smooth estimators. The basic idea is to replace the empirical 
distribution function in the log- likelihood by a smooth estimator. We prove that for a histogram- 
type smoothing of the observation distribution the resulting MSLE is consistent under certain 
conditions. Contrary to the plug-in inverse estimator studied by Groeneboom et al. (2011) the 
MSLE is a real distribution function. 

The outline of this paper is as follows. In section [2] we introduce the CSCM model in more 
detail and define the MSLEs F^ IS and f^f for the distribution function Fq and its density. In 
section jij consistency of the bivariate estimator F^ IS and the marginal estimator F^fjj! for the 
distribution function of X are proved. A comparative simulation study is presented in section [4j 
Technical proofs and lemmas are given in appendix [Aj and in Appendix [B] we also give a desription 
of an easy to implement EM algorithm for computing the MSLE. 



2 Model description and definition of the estimator 

In this section we describe the current status continuous mark model in more detail and define the 
maximum smoothed likelihood estimator (MSLE) /„ for the bivariate density /o- The smoothed 
log-likelihood is obtained by replacing the empirical distribution function in the definition of the 
log-likelihood by a smooth estimator H n for the distribution function. We prove that for piecewise 
constant density estimates h n the estimator f£f s exists and is unique under certain conditions. 
Based on f^f s we also define the MSLE for the bivariate distribution function Fq and for the 
marginal distribution function Fq^x of X. 

Consider an i.i.d. sequence (Xi, Yx), (X2, Y2), . . . with bivariate distribution function Fq on W = 
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[0, oo) x [0,oo) and independent of this an i.i.d. sequence Ti,T2, . . . with distribution function G 
and Lebesgue density g on [0, 00). Based on these sequences, define A, = ljx^TJ an d W% = 
(Tj,Aj • Yj) =: (Ti,Zi), where we assume that P{Yi = 0) = 0. In words: if the event already 
occurred before time Tj, the mark variable Yi is observed; if not, Y^ is not observed. Note that 
A i = 1 {Z l >0}- 

Let Fo t x{i) = Jo Jq° fo{ u i v ) dv du be the marginal distribution function of X and define 
d 2 Fo(t, z ) = §i^o(ti z) = f fo(u, z) du. Then W±, W 2 , ■ ■ ■ are i.i.d. and have density 

h f (t,z) = ^{z>o}(z)9(t)d 2 Fo(t,z) + l {z=0} (z)g(t)(l - F QtX (t)) =: l {z>0} (z)/ii(t, z) + l {z=0} (z)h (t), 

with respect to the measure A on [0,oo) 2 defined below. Let be Lebesgue-measure on 1R 1 , B the 
Borel cr-algebra on [0, oo) 2 , then the measure A is defined by 

\{B) = \ 2 (B)+\ 1 ({x G [0,oo) : (z,0) SB}), B G B. (2.1) 



Let H n be the empirical distribution function of Wi, . . . , W n . |Hudgens et al. (2007) define and 



characterize the nonparametric maximum likelihood estimator (MLE), which maximizes 



1(f) = J logh f (t,z)dM n (t,z) 

= J l {2>0} (z)log (d 2 F(t,z))+l {z=0} (z)log(l-F x (t))dM n (t,z) 



(2.2) 



over the class of distribution functions with density / w.r.t. Ai x counting measure on the observed 



marks, with appropriate interpretation of the partial derivative. However, Maathuis and Wellner 



(2008) prove that this MLE is inconsistent. The heart of the consistency difficulties with the MLE 



resides in the first part of the log-likelihood J \{ z>0 }(z) log d 2 F(t, z) dM n (t, z), where one really has 
to deal with a density type expression in z (i.e., d 2 F(t,z) = f f(s, z) ds) instead of a bivariate 
distribution function. 

We now propose an alternative likelihood-based method, the method of maximum smoothed 



log-likelihood introduced in Eggermont and LaRiccia (2001) , where the resulting estimator for Fq 



will turn out to be consistent. Let H n be a smoothed version of the empirical distribution function 
H n , then the smoothed log-likelihood I s is defined by replacing H n in (2.2) by its smoothed version 
H n , i.e. 

I 8 if) = J l{,>o}(^)log {d 2 F (t,z)) + l {z=0} (z)log(l-F 0iX (t)),dH n (t,z), 

Note that the factorization property of the MLE in the current status model, by which the part 
involving g drops out, also holds in the present case: we do not have to maximize over the unknown 
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g, because it does not play a role in the maximization problem. The maximum smoothed likelihood 
estimator (MSLE) f^ IS for the density fo is then defined as 

/ n M5 = argmax/ 5 (/), (2.3) 

where T is the class of all distribution functions with density / w.r.t. Lebesgue measure on [0, oo) 2 . 
The MSLE for the bivariate distribution function Fq is naturally defined as 

ft l-Z 

?MSi+ „\ _ \ \ IMS i 



F^{t,z) = / / f^(u,v)dvdu, 
Jo Jo 

and the estimators for Fox and c^-^b are defined similarly, 

F™x(t) = f f°° fn S {u, v) dv du, d 2 F^ IS (t, z) = §-/™ S (t, z) = f /„ M5 (u, z) du. 



Jo 



Note that the MSLE /„ can also be seen as a Kullback-Leibler projection, minimizing 
JC(h n ,hf) = I h n (t, z) log ^" * dX(t, z) = [ log h n (t, z) dH n (t, z) - f log h f (t, z) dH n (t, z), 



hf(t,z) 

over densities / E T . This follows from the fact that the first term on the right-hand side does not 
depend on / and the second term on the right-hand side equals —l s (f). 

In this paper, for the ease of computations and proving consistency, we take a histogram-type 
estimator for the density h n of H n , resulting in a piecewise linear estimator H n . There are other 
possibilities as well to choose H n , see section [4J To define our estimator h n we take two binwidths 
5 n and e n and define A n>i = ((i - l)5 n , iS n ] = (a n>i -i, a n>i ] and B n j = ((j - l)e n , je n ) = (6 n ,j-l> K,j] 
for i = 1, . . . , k n , j = 1, . . . , l n . Then slightly abusing notation, the estimator h n is defined and 
denoted by 

h n {t,0) = hi = ^ 1 H n (yl niJ x {0}), if t G A nji 

h n (t, z) = h id = 5- 1 e~ 1 M n (A nti x B nJ ), if (t,z) G A nji x B n>j . 

We consider the estimator /„ that is obtained by maximizing I s '(/) over the class T n of 
piecewise constant densities with cells A n i x B n j. For the resulting histogram-type estimator 
we can prove that it is well defined and unique if all cells contain at least one observation. This 
holds with high probability under certain conditions on the observation density hf and the total 
number k n ■ l n of cells A n ^ x B n j. 



6 



Before proving the existence and uniqueness of f^ s , stated in Theorem 2.2 below, we introduce 



some notation to relate the class of densities we consider to appropriate subsets of Euclidean space. 
B n = {/ e R fc " x/ " : < fa < {SnBn)- 1 Vi, j} (2.4) 

ai(f) = 5 n e n Aj» Mf) = 6 nYl A* for / G B " ( 2 - 5 ) 

i=i 3=1 i=l 

with OlogO := 0, a kn+1 (f) = and /3 j(/) = for all j. 
Lemma 2.1 Define 

^-l(/) =*n5Z^i ^(A,i(/)jA-lj(/)) 
i=l i=l J=l 

i=l 3=1 

J (xlogx - ylogy)/(x - y), ar, y € [0, 1], x ± y . . 

[ 1 + logx, x,y e [0, 1J, x = y, 

then 

argmax/ s (/) = argmax^>_i(/). 
feT n ' feB n 

The proof is given in the Appendix. 

Using this lemma, we can prove the existence and uniqueness of f£f theorem below. 

Theorem 2.2 If hi > and hi j > for alli,j, then the maximizer f^ IS of I s over T n exists and 
is unique. 

Proof: The function ip is continuous on (0, l] 2 so ip—i is continuous on the compact set B n . Hence 
V>-i attains its maximum over B n and f^ ls exists. Uniqueness of f^f s follows from the strict 
concavity of I s '(/) on its domain. □ 



Remark 2.3 If hi = for some % or hij = for some i and j, we can construct examples where 
f£f is not unique. However, we can prove under various conditions on hf , k n and l n that 

P(3i : ^ = 0)— ►(), P{3i,j : h itj = 0)^0, (2.7) 
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so that with probability converging to one f^f is well defined for n sufficiently large. For example, 
if k n = l n and for some c > 



h (t)dt > -2 

A n j. fc n 



for all i and j, then 



Ia, 



hi(t, z) dt dz > 



(2.8) 



P(3 i,j : htj = 0) <J2 P ihi =0)<qi 



1,3 



•0, 



if n 1 kf l logk n ->■ 0. 

A similar argument shows that also P{3 i : hi = 0) < k n (l — p-J — t-0. 

Assume for example that fo has compact support Wm = [0, M\] x [0, M2] for some constants 
< M\,M2 < 00 and stays away from zero on its support, and that g > K\ > on [0, Mi]. 



Then (2.8) is satisfied. This condition is far from necessary, but only meant as an illustration for 



a condition under which (2.8) is satisfied. 



By Theorem 2.2 we know conditions under which the estimator f^f s defined in (|2.3|) exists and 



is unique. A simple EM algorithm for computing the MSLE is given in the appendix. 



3 Consistency of F^f s 

In this section, we prove that h^f s = hp. IS and F^f s are consistent estimators for the density hf 
of the observable vector W and the bivariate distribution function of interest -Fo , respectively. To 
prove this we assume the densities fo and g to satisfy conditions (F.l) and (G.l) below. We also 



assume fo and g are such that hf satisfies the conditions needed for (2.7) so that existence and 
uniqueness of f^ IS are guaranteed with probability converging to one. Furthermore, we assume the 
binwidths 6 n and e„ to satisfy condition (C.l) below. 

(F.l) The density fo has compact support Wm = [0, Mi] x [0, M2] and is continuous on Wm- 

(G.l) The censoring density g is uniformly continuous and bounded away from zero and infinity on 
(0, Mi), i.e. < c g < g(t) < C g < 00 for all t G (0, M x ). 

(C.l) The binwidths 5 n and e n converge to zero such that n5 n e n — > 00 as n — > 00. 



Note that if 6„ 



71 



-1 / 5 and e r 



n 



-1 / 5 , condition (C.l) is satisfied. The choice of 5 n of order 



n 



" 1//5 is probably optimal. One can also choose e n of this order, but it is probably better to choose 



£n *C 5 n , see for a discussion on this matter Groeneboom et al. (2011) Further remarks on the 
problem of binwidth choice can be found in section [4] 
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Figure 2.1: The estimator f^ s (upper panels) and contour plot of F^f s (lower panels) for two 
simulations: fo(x,y) = l[ ,i]2(x, y), g(t) = l[ ,i](*) (left panels) and fo(x,y) = x + y, g(t) = 2t 
(right panels), n = 5 000 and 5 n = e n = 0.2 chosen as illustration. The dash-dotted lines are 
the true distribution functions Fo for these two examples. The levels of the contour plot are 
0.01, 0.05, 0.1, 0.2,..., 1.0 



Lemma 3.1 Let /o and g satisfy conditions (F.l) and (CI) and 5 n , e n condition (C.l). Further- 
more, assume that Jg Ml log ho(t) dt < oo, Jq Mi J" Q Ml log hi (t, z) dz dt < oo. Then hff s is Hellinger- 
consistent for hf , i.e. 



(3.1) 



Proof: We establish (3.1) using relation (A.l) and the property that f^ ls minimizes K,(h n , /i/ ). 
Since /o T n in general, the inequality 

K,(h n ,hjM S ) < IC(h n ,hf ) 

need not hold. In order to exploit the defining minimizing property of f£f s , we define a piecewise 
constant representative /„ o of /o which belongs to T n and approximates /o 



fnfi(t,z) — ^ n l£ n 1 



f (u, v) dv du if (t, z) G A nji x B n ^ 



For this representative it holds that 



< 2n{h n ,h^ Is y < K[h n ,h™ S ) < K{h n ,h InQ ), 



(3.2) 



also using relation (A.l). The Hellinger distance is a metric, hence applying the triangle inequality 
twice gives 



< ■H{h K n IS ,h h ) < H(h™ s ,h n ) + H(h n ,h fn0 ) +U[hj nQ ,h h ). 



(3.3) 



The first term on the right hand side of (3.3) converges in probability to zero by combining (3.2) 



and Lemma A. 3 The second term converges in probability to zero by combining Lemma A. 3 and 



relation (A.l). The third term converges to zero by combining relation (A. 3) and the second result 
in Lemma A.l, hence (3.1) follows. □ 



From Lemma 3.1 it follows that Fjf converges pointwise and in Li-norm to Fq. 



Theorem 3.2 Under the conditions of Lemma 3.1 



\\F^-F 



v 



1 



(3.4) 



Consequently, for all (t, z) £ Wm 



F^(t,z)^F (t,z) 
implying that sup w \F** s (t, z) - F (t, 



V 



0. 



(3.5) 
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Remark 3.3 Because F^(t) = F% lb (t,M 2 ), this lemma implies that \\F™£ - F^ x 



?MS 



MS 



V 



Proof of Theorem 3.2[ By combining Lemma 3.1 and relation (|A.2|), we have that 



h^ s (t,z)-h fo (t,z) d\(t,z) 



Mi r-M 2 



g(t)\d 2 F^ s (t,z)-d 2 F (t,z)\dzdt 



Jo 

Mi 



+ / g(t)\{l - F$(t)) - (1 - F , x (t))\dt < V2n(h^ s ,h fo ) A 0. 
J o 



This implies that 



\d 2 F™ s - d 2 F Q \\ l 



d 2 F™ s (t, z) - d 2 F (t, z)\ dX(t, z) 



v 



0. 



W M 



since g > on (0, Mi). To prove (3.4) note that the Li-distance between F^f s and Fq can be 
bounded by 



1 



W M 



W m 



F^ s (t,z)-F Q (t,z) d\(t,z) 



o 



d 2 F^ s (t,v) -d 2 F (t,v)) dv 



d\{t,z) 



/•Mi /-M 2 /-M 2 

< / / / |d 2 F,f s (M) -d 2 F {t,v)\dzdvdt 

Jt=0 Jv=0 J z=v 

<M 2 \\d 2 F™ s -d 2 F Q \\ x AO. 



To prove (3.5), assume it does not hold for a certain (i, z) 6 Wm, i.e. there exists e > 0, S > 
and a subsequence n*. of n such that for all k £ N 



P(jF^(M)-*b(M)|>*) >e. 
Assume PJf s (t, z) < Fo(t, z) — 5, then there exists a small c > such that 



V (u, v)€[t- cS, t]x[z- cS, z] =: As : F^ k s (u, v) < F (u, v) - ^6, 



by continuity of Fq and monotonicity of F*f s . This implies that for all k 

■2 



P 



> P 



\F^(t,z)-F (t,z)\dzdt>-5 



F^ s (u,z) - F {u,z)\ dzdu > —5 



As 



> P I V (u,v) € As : Ff h s (u, v) < F {u, v )-^5)>e. 
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If F^f (t, z) > Fo(t, z) + 6, we have by a similar argument that 



V (u, v) €[t,t + cS] x [z, z + cS] =: As : F^ s (u, v) > F (u, v) + -5, 



giving that for all k 
P 



F^(t,z)-F (t,z)\dzdt>-5 



> P (V (u,v) € As : F^ k s {u, v) > F (u, v) + ^6 ) > e 



This contradicts (3.4). Strengthening pointwise consistency to uniform consistency over Wm follows 
from monotonicity of F^ s and Fq and the assumed smoothness of Fq. □ 



4 Discussion 

In this paper we have considered consistency of the MSLE, where the observation distribution was 
smoothed by using histogram type estimators. Rigorous derivation of the asymptotic distribution 
is at this moment still not available. Heuristic considerations indicate that, if (toj-^o) is an interior 
point of the support of /o, and the binwidth for the first coordinate satisfies 5 n ~ cin -1 / 5 , whereas 
the binwidth for the second coordinate satisfies n~~ 2 / 5 <C e n <C n -1 / 5 , we get 

n 2 / 5 (F n M5 (t , zo) - F Q (t Q , zo)) - tf(J3, a 2 ), 

where J\F(/3, cr 2 ) is a normal distribution with expectation 

d 1 F (t ,z )g'{t )c 2 1 
P Wh) 

and variance 

2 _ Fo(to,^o)(l-i 71 o(to,^o))\ / 3 



2ctg(t 



This implies that the asymptotically optimal number of cells for the first coordinate would 
satisfy 



r i ^ 3-1/2 ( 2g'(t ) 2 d 1 Fo(t ,z ) 2 
n ^" \g{t )F (t ,z )(l-Fo{t ,z )) 



n 



implying that the optimal number of cells on the first coordinate is rather small for the model on 
which the simulations, reported below, are based. 
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Figure 4.1: Estimates of the function t h-> i*b(ij -2)) where z = 0.6. The MSLE is shown in the left 
panel and the plug-in estimate F n , defined by ( |4.1[ ), in the right panel. The MSLE and F n are the 
piecewise linear solid curves in the pictures and the dashed curves represent the real Fo, where F n 
is linearly extended to the last interval (where it can not be defined by interpolation between values 
at successive points of the grid). Moreover, Fo(t, z) = \tz(t + z), g(t) = 2t, and the sample size for 
which the estimators were computed was n = 5000. The binwidth for the first coordinate was 0.2 
for the MSLE and 0.1 for the plug-in estimator. For the second coordinate we took binwidth 0.2 
for both estimators. 

The behavior of the MSLE F^f s is somewhat similar to that of the plug- in estimator F n , defined 

by 

„ ( h \ - ^..u^.+i^gC^njl dM n {t,z) 
JteA n>i uA n>i+1 daj «W 

at the points (a n> i, b n , ) of the grid, and by linear interpolation elsewhere (except on the last interval 
A n ,k n , where it was just linearly extended), where G n is the empirical distribution function of the 
observations Ti,...,T n . However F n and F^ IS are not asymptotically equivalent, as first was 
noticed in the simulations. We have, as (a nt i,b n j) — > (to,Zo), under the same conditions on the 
binwidth S n and e n as used above, 

n 2 ' 5 (F n (a n>i ,b njj ) - F (a nji ,b n j)) M(/3 2 , erf), 

where 

R \ X &T?( + \ 1 WMM ) 2 2 F (t ,Zo)(l-F (to,Z )) 

= \ ^dfF (t , z ) + — — } cy, a 2 - 



G L uv u ' u/ 3g{to) J z 2c l9 (to 
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This implies that the asymptotic variance is smaller by a factor \/3 than the conjectured asymptotic 
variance of the MSLE. On the other hand, the asymptotic bias is larger than the conjectured bias of 
the MSLE F^ 1S in the model, used in the simulations which produced Table [l] It seems unavoidable 
that the relation between plug-in estimators of this type and our MSLE F^f involves the partial 
derivative t?2-?n , which makes the analysis rather complicated. 

Other smoothing methods are also possible, for example using kernel estimators instead of 
histogram estimators for the smoothing of the observation distribution. However, we do not know 
how to compute the MSLE for this type of smoothing. Using a smoothed MLE (SMLE) is not 
sensible because it inherits the inconsistency of the unsmoothed MLE. 

In Table [j] we compare the local mean squared error (MSE) of the MSLE with the MSE's of 
other comparable estimators. On the second coordinate we took 5 cells for the MSLE, which means 
that the bias on the second coordinate does not play a role, since 0.6 is then a point of the grid 
for the second coordinate, and on the first coordinate we took the number of cells between 4 (for 
n = 500) and 7 (for n = 10 000). The results were obtained by generating 10 000 samples for each 
value of (t,z), considered in the table, and each sample size n. We compared the results with the 



MSE's of the plug-in estimator F^, studied in Groeneboom et al. (2011) , and defined by 



zo) 



ze(o,z 



h n (to - u) dB n (u,z) 



f h n (to - u) dG n (u) 



(4.2) 



where A; is a smooth symmetric kernel with support [—1,1], for example the Epanechnikov kernel, 



and S n the bandwidth. Note the similarity between (|4.2|) and (|4.1|). We also included the binned 
MLE of 



Maathuis and Wellner (2008) in our comparison. The values for F^ and the binned 



MLE were taken from Table 5.1, p. 115, Witte (2011), where the bandwidths, resp. binwidth, were 



chosen in such a way that the MSE was minimized. As can be seen from the table, none of the 
four estimators comes out as uniformly best in this situation. 



A Technical lemmas and proofs 

In this section, we prove most of the results stated in the previous sections as well as some technical 
lemmas needed in these proofs. We start with some known results on several distances. 

Let / and g be two probability densities with respect to a dominating measure ^. Let T~L and 
K, denote the Hellinger distance and the Kullback-Leibler divergence between / and g respectively, 
i.e. 

U(f,g) = J\ j {yfm-^gjx)fd^x), K[f,g)= j f(x)\og&-d^(x). 
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Table 1: Estimated values of the MSE for four estimators of Fo{to, 0.6) at a number of values of to- 
The boldfaced values in each row are the minimal values of the MSE in that row. 



n 


MSLE F*f s 


Plug-in F n 






binned MLE 


500 


2.12 x 10~ 3 


1.41 x 10~ 3 


2.81 x 10" 


-3 


7.84 x 10 4 


1000 


1.86 x 10~ 3 


7.73 x 10~ 4 


1.53 x 10" 


-3 


2.01 x 10 4 


5 000 


3.19 x 10~ 4 


1.96 x 10~ 4 


2.04 x 10" 


-4 


1.49 x 10 4 


10 000 


1.35 x 10~ 4 


1.11 x 10" 4 


9.59 x 10" 


-5 


1.13 x 10~ 4 



to 


= 0.4 


500 


8.39 x 10 


-4 


1.25 x 10- 


-3 


9.07 x 10- 


-4 


1.21 x 10- 


-3 






1000 


4.90 x 10 


-4 


7.07 x 10- 


-4 


5.94 x 10- 


-4 


6.74 x 10- 


-4 






5 000 


1.21 x 10 


-4 


1.90 x 10- 


-4 


1.32 x 10- 


-4 


2.37 x 10- 


-4 






10 000 


8.35 x 10 


-5 


1.08 x 10- 


-4 


8.95 x 10- 


-5 


1.35 x 10- 


-4 


to 


= 0.6 


500 


6.32 x 10 


-4 


1.17 x 10- 


-3 


8.21 x 10- 


-4 


1.38 x 10- 


-3 






1000 


3.71 x 10 


-4 


6.86 x 10- 


-4 


5.31 x 10- 


-4 


7.79 x 10- 


-4 






5 000 


1.48 x 10~ 


-4 


1.86 x 10- 


-4 


1.21 x 10 


-4 


2.11 x 10- 


-4 






10 000 


7.80 x 10 


-5 


1.06 x 10- 


-4 


9.21 x 10- 


-5 


1.31 x 10- 


-4 


to 


= 0.8 


500 


6.71 x 10- 


-4 


9.43 x 10- 


-4 


5.91 x 10 


-4 


1.39 x 10- 


-3 






1000 


5.88 x 10- 


-4 


5.85 x 10- 


-4 


3.14 x 10 


-4 


8.59 x 10- 


-4 






5 000 


9.65 x 10- 


-5 


1.81 x 10- 


-4 


5.61 x 10 


-5 


2.27 x 10- 


-4 






10 000 


5.84 x 10- 


-5 


1.04 x 10- 


-4 


3.25 x 10 


-5 


1.39 x 10- 


-4 



Between T~L, KL and the Li-norm 



we use the following relations 



2H{f,g) 2 <K[f,g), 

n{f,g) 2 < l\\f-g\\i<V2H{f,g), 



(A.l) 
(A.2) 



see e.g., van de Geer (2000) Lemma 1.3 for (A.l) and LeCam (1986) p. 47 for (A.2). If / and g 



have compact support C with finite measure /j,(C) = C < oo, then 



H(f, g y < ||/ - = J c \f(x) - g{x)\dn{x) < C\\f - glU 



(A.3) 



Now, we can turn to the proofs and technical lemmas. 
Proof of Lemma |2.1[ Let / G F n , then for t G A n ^ = (a nj j_i, a nj j], z G B n j = (b n j-\, we 
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can write 



/ f(u, z) dz du = (a n>i - t) ^ e n fij + ^ ^ Ke n fi,j 
-=* ' /2=0 j=i «=i+i j=i 

/•t »-i 
5 2 F(t, z) = / /(u, z) du = ^ S nfl,j + (t- a nti -i)fij, 

Ju=0 l=1 



so that 



log (l - Fx(t)) dt = / log <| ^ ^ S n e n fij + a nji ^ e n /ij - t ^ e n /jj 

-W J A-n,i [j = ll=i+l j=l j=l 

/ / log d 2 F(t, z) dz dt = e n / log I ^ 8 n f hj - a n ^if id + /jjt \ dt. 
J A n ,i J B n j J A„ } i ( 1=1 J 



Since we have for < a < b < oo, a / and r > —era 



/ log(r + at) dt = — [u log u — u] 

J a ® 



r+ab 
u=T+aa 



— (r + ab) log (r + crfr) — — (r + era) log (r + era) 
<T cr 



we get 



/ log { 



j=ll=i+l j=l j=l 



e n Si=i fid 



<*"n. L ri 



l=i+l j=l 

'■'77. l T7. 



" <$ n £n /ij log S n E n ^ Yl ]( ~ 6r > 



l=i j=l 



l=i 3=1 



and 



£n / lo S \ X] Sn f l >3 ~ a n,i~lfid + fid 1 \ dt 



'i-1 



i-1 



,1=1 
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so that 



fen 



1 £ " S/=l fi,j 



5 n £n E E fi,j log 5 n e n E E ^' 



1,3 



+ EE 



SnSn E E hi I bg ( Sn£n Y Y h'j 

l=i j=l 

/ i-1 



li=l ^ 



«=« i=i 

<*« E A* l0g 5 ™ E fh 



i-1 



3 



1=1 j- 
h h 1 

- E ~ E E ^ e nhi,j- 

i=l j=l 



1=1 



s n Y fa J log y n Y fa 



1=1 



1=1 



The last two terms can be left out in the maximization, since they do not depend on /. Now, 
taking aj(/) and Pij(f) as in (2.5) we have that 



In 



-n ^2 foj 



3=1 

so that I s '(/) = for 



~(a i+1 (f) - «,(/)), f id = 1 - , 



^(/) = <^E^ f{a i+1 (f),ai(f)) +5 n e n YY" hi 'i ^(AjCf), A-l,i(/)) 

i=l i=l i=l 



with 92 as defined in (2.6), hence the maximizer of I s (f) is the maximizer of ip(f)- 
The estimator f^f s has to satisfy the following conditions 

i=l j=l 

(S.2) Vt,j : /<j >0. 

To get condition (5.1) in the objective function tp{f), we include a Langrange multiplier A £ E 
and maximize 

i=i j=i 

over V n = {/ G ]R, fe ™<™ : > Vi,j}. For any function / satisfying condition (5.1) and each A, 
4>\(f) equals ip{f) and for f\ = argmax /6 p n ip\(f) we have 



= lim r 1 (Va ((1 + 7) A) - (/a)) = 1 + ™ n e n Y Y kiJ- 



i=l j=l 
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This means that if we take A = — 1, the global maximizer of ip\{f) over V n is contained in Tn, so 
that 

arg max tp_i(f) = arg max tp(f). 

Since we also have that F n C B n , it follows that argmaxj g -p n V-i(/) £ &n, hence 
arg max I ( f) = arg max ib_\ ( f). 

□ 



The piecewise constant representative f n fl and the corresponding hj nQ converge to the true /o 



and hf under condition (C.l). This is stated in Lemma A.l below. 

Lemma A.l Let fo and g satisfy conditions (F.l) and (G.l) and 5 n , e n condition (C.l), then 



Wfnfi - M 



0. 



\ h fn,0 - h fo\\oo 



0. 



Proof: For the proof of the first result, note that for (t, z) £ A n ^i x B 



1 1 -j 



\fnfl(t,z) - fo{t,z)\ 



fo(u,v)du dv - f (t,z) 
\fo(u,v) - fo(t,z)\ dudv 



< max \f (u,v) - f (t,z)\. 

{u,v)eA n ^xB n j 

Using (C.l) and the uniform continuity of /o on Wm we get for (t, z) G Wm 

\fn,o(t,z) - fo(t,z)\ < max max \f (u, v) - f (t, z)\ — >0, 

uniformly in (i, z) as n — > oo. This implies the first result. 

For the proof of the second result, note that for (i, z) G Wm 

g(t)\ l {z>0} / {fnfi(u, Z) - f (u,z)}du 
I Ju=0 

/•Mi rhh 

+l{ z =o} / {f n ,o(u, z) - f (u, z)} dz du 

J u=t J 2=0 

<\\g\\oo {Mi||/„, - /olloo + M 1 M 2 \\f n ,0 ~ /olloo} ■ 

Since this upper bound does not depend on t and z, the second result now follows from the first. □ 
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Lemma A. 2 Define for (t, z) £ A n ,i x i? ni 



h n (t,z) = Bh n (t,z) = 6 n e n / / hf (u,v)dX(u,v), 



then, under the conditions of Lemma A.l 



hr — h r 

Jn,0 % 



0. 



(A.4) 



Proof: First, note that 



l^/n, ~~ h n \\oc < \\hj nQ - /i/Joo + \\hf - h n \\oo- 



The first term converges to zero by Lemma A.l We now prove that the second term also converges 
to zero. To see this, note that 

\\hf — hnWoo < \\h\ — /il,n||oo + 11^-0 _ ^0,n.||oo- 



Similarly as the first result in Lemma A.l we have for (t, z) G A n ^ x B n 



\hi(t,z) - hi >n (t,z)\ 



rV 1 



-1^-1 



< max 

(u,v)eA n! iXB n>j 



[hi (t, z) — hi (u, v)) dv du 

hi (t,z) — hi (u, v) | dv du 
hi(t,z) - hi(u,v)\. 



Both g and c^-Fo are uniformly continuous, hence hi is as well and with condition (C.l) we get for 

(i, z) e W M 



\h\(t, z) — hi n (t, z) < max max \hi(t, z) — hi(u, v)\ 

i,j (u,v)eA nti xB nt j 
uniformly in (t, z) as n — > oo. Via a similar argument we get that 



\ho(t) — ho, n (t)\ < max max \ho(t) — ho(u)\ — >0, 



uniformly in t as n — > oo, hence ||/i/ — /i n ||oo — > 0. 



0, 



□ 



Lemma A. 3 Under the conditions of Lemma 3.1 such that hf satisfies (2.8) 



(A.5) 
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Proof: We can write 



h n (t, z) log j hn ^. Z \ dX(t, z) 
h n (t, z) log ^ n ^ Z \ d\(t, z) - 



h n (t, z) log ^ n ^ Z \ d\(t, z) 

tlf (t,Z) 



(A.6) 



hj Q (t,z) 
h n (t, z) log -"' dX(t, z). 
W M n n (t,z) 



The expectation of the first term converges to zero by Barron et al. (1992) Theorem 5 with 
/i = Hf Q (the distribution function of the observable vector W), v = X as defined in (2.1) and 
V = {[0,Mi] x {0}, [0,Mi] x (0,M 2 }}. 

By Fubini's theorem, the expectation of the second term equals 



EA n (M)log^4dA(M) 



h n (t, z) log f - "^' — dX(t, z) = K(hn,hf ). 
w M h fo {t,z) 



This converges to zero by Barron et al. (1992) Theorem 4, so also the expectation of the second 
term in (A.6) converges to zero. 



By (2.8), h n (t,z) > > for all (t, z) G Wjw, so that (A. 4) implies that for any e > and n 



sufficiently large 
1 

Then 



e_ < 1 | h fnfi (t,z)-h n {t,z) e 
c /i h n (t,z) Cfr 



log 1 



/i n (t,z)log 1 



dX(t,z) 



< 



< 



h n (t,z) log 1 + 



h n (t,z) 



dX(t, z) 



Wm 



h n (t, z) log ( 1 + — ) dA(t, 2) = log ( 1 + 



so that also the expectation of the third term in (A.6) converges to zero. Therefore, the expectation 

□ 



of K, (h n ,hf n0 ) converges to zero, and because K,{h n ,hj n ^) > a.s. the convergence in (A. 5) now 
follows. 



B The EM algorithm 

def 

Let, as before, H n denote the smoothed H n , using the histograms on the rectangles Rij = A n ^ x 
B n j of the grid. The MSLE has to maximize 

logd 2 F(t,z)dH n (t,z)+ f log(l-F x (t))dH n (t,0), 



z>0 
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where 

d 2 F(t, z) = 8-' I J2 hi + ' 'A; | , if (t, z) e Ri d , 

{l:a„ } i<t " J 

l-F x (t)= Yl E^ + ^f^E/^ ^teA n>i , 

i:a n ,;_i>t j=l n j=l 

and Ylij fi,j = 1- Note that we do not parametrize by the densities, but by the total mass fij of 
the distribution on a cell = A ni j x B n j. This amounts to the same for this model, however. 

The .E-step, if t € A n ^ and z G B n j, and z > 0, is given by 
E(\ogf(X,Z)\T = t,Z = z) 



Am) t-a n ,j-\ Am) 

{kj j f ■ _| Sn -M l oe; f. 

Am) t-o w ,i-i Am) ^ Jk ^ ^ Jm) t-a n ,i-i >W 

fc<i 2-^i.a n j<t J l,j <5 n JiJ 2^l:a n j<t J l,j S„ Ji,j 



after the mth iteration, and if z = we get after the mth iteration 
E(log/(X,Z)|T = t,Z = 0) 

p(m) a n ,i-t p(m) 

fc>i /-jl:a n> i-i>t r I "i" <5 n r i 2^:<j n> i_i >i M ' 5 n i 

where Fj = ^j=i/ij- We have to integrate this over (£, z) G i^j w.r.t. the density /i n , and then, 
in the M-step, we have to maximize the resulting expression w.r.t. foj. This leads to the follow- 
ing combined F-step and (approximate) M-step (corresponding to the so-called "self-consistency 
equations" ) 

J k,j ~ 2^ / . a ^ Am) x . ,. ,A m ) ' 3 

i>k JteA nti ,zeB nJ f >5 n + {t- an,i-i)fl j 



f (t- an,fc)/i m) 

+ / ~ 7^ '~ — ~ ; — ?z^\ hk,i dtdv 

Jtt 



' . r-v A m ) X ( 4- \ f{ m ) 

tdA nik ,zdB n>J 2^l:a n}l <t Jl,j d n + [t ~ On,k-l)Jk,j 

(m) 



+ Y f r-Mt 

f£ JteA^zeB^ Z l:an l _ 1>t F™6 n + (a n>i - t)F t {m) 

+ / 7 — n 7 — \ lik «£j 
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where hij is the value of h n (t, z) if (t, z) G Rij and /ij is the value of h n (t, 0) if t € A nj j. Hence 



^EM^'SVEtf 



(m) 



f (m) 



i>fc 



(m) 



P (m) 



(m) 



^,3 



+ //& ) E 1 °g( 1 + ^ (m) /E^ 



(m) 



-,(m) 



Z<fc 

(m) 



(/SO 



S n 



p (m) 



(m) 



,-,(m) 



Z>fc 



(**•>) 



2 u n, 



for 1 < k < k n . For k = 1 we get 



/i7 +i) =^,a^ + /#> e bg ( i + f ™ i y: ft 



f (m) 



/?,; 



i>l 



Ki 



(m) 5n£n 



+ /&> ^ (m) - E F / m) lo g ( 1 + F i (m) / E F i 



(m) 



Z>1 



1>1 



2 



and for k = k n 



,(m+l) _j:(m) < 



k n ,j 



(m) 



f (m) 



( Am)\ 
\Jk n ,j) 



: S>n,£r, 



+/SE 1 °g( 1 +t , /E f ; 



(m) 



i(m) 



l>i 



F 



hi ( m ) h kn 



(m)' 



These iterations were used until the absolute value of the scalar product of the vector of values 
fiT w ^ vector °f va l ues °f partial derivatives of the criterion function w.r.t. fj\^ was smaller 
than 1CT 10 (here we use the so-called Fenchel duality condition). The algorithm is very fast and 
can easily be used for simulation purposes, also with sample sizes like n = 10 000. 

Note that 



H 1+ ^VE« 



Am) 



h 



Ki 



J l 



h3 

(m) 



h 



1,3 



47VECi°' 



(m) 



Sz<i // 



(m)' 



Ki J l,j 



Ki 



and 



/^-E/S^h+zSVE/* 



(m) 



Am) 
'1,3 



l k,j 



Z<fc 



Z<fc 



(/«) 2 El<l /S ,r 



(m) 



Z<fc 



which allows individual fj^ to tend to zero during the iterations. Similar relations hold for the 



,(m) 
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